Projecting 1 km-grid population distributions from 2020 to 2100 globally under shared socioeconomic pathways

Spatially explicit population grid can play an important role in climate change, resource management, sustainable development and other fields. Several gridded datasets already exist, but global data, especially high-resolution data on future populations are largely lacking. Based on the WorldPop dataset, we present a global gridded population dataset covering 248 countries or areas at 30 arc-seconds (approximately 1 km) spatial resolution with 5-year intervals for the period 2020–2100 by implementing Random Forest (RF) algorithm. Our dataset is quantitatively consistent with the Shared Socioeconomic Pathways’ (SSPs) national population. The spatially explicit population dataset we predicted in this research is validated by comparing it with the WorldPop dataset both at the sub-national and grid level. 3569 provinces (almost all provinces on the globe) and more than 480 thousand grids are taken into verification, and the results show that our dataset can serve as an input for predictive research in various fields.


Dataset
Time Span

Resolution Sources
Historical population data on the globe www.nature.com/scientificdata www.nature.com/scientificdata/

Methods
We predict the global spatially explicit population grid from 2020 to 2100 by building an RF model based on spatial path dependence. The spatial path dependence 28 can reflect the influence of initial or early conditions on process evolution, and suppose the population distribution at time T2 is affected by the distribution at time T1 as well as other environmental factors. Based on methods 17,22,23 utilized for population projection (calculate population potential surfaces and allocate administrative level population), our process utilized a random forest algorithm for calculating population potential surfaces because of the excellent predictive performance and wide application in population prediction 23,29 . Our method involves three procedures: (1) preparations before projection: considering that there are large disparities between different regions on the globe, dividing countries/ territories into 8 regions 30 , and randomly sampling enough points in these regions to develop our RF model based on 2015 WorldPop (see section Sampling method). (2) calculating projection model: training apposite model for each region and calculating population potential surfaces (see section RF model training) and (3) conducting future projection at 5-year intervals for each region under five SSPs (see section Future prediction). The method framework of this research is shown in Fig. 1 and details of each procedure will be explained below.
WorldPop dataset. The WorldPop project 31 provides global gridded population data at a resolution of 30 arc-seconds (~1 km at the equator). WorldPop's strength is that its model is able to identify significant relationships from incoming census data and ignore rural areas without obvious satellite-derived built-up areas 32 . WorldPop also makes all source code publicly available and method transparent, and integrates various inputs and auxiliary data so that models can use different weights to redistribute populations between census or administrative unit counts 33 . One of the major weaknesses and criticisms of "WorldPop" is that its model has no other constraints except for water bodies, and the dataset dasymetrically redistribute population in administrative units throughout the whole unit areas, not just within the grid cells classified as "built-up".
Based on the strengths and weaknesses of the WorldPop dataset, combined with the comparative analysis results of the released global gridded population datasets (including GPW, GHS-POP, WorldPop, and LandScan) by Yin et al. 34 , and considering the problem of data time series, we decided to use the unconstrained global population grids as the population input data for this study.
Other source datasets. The existing studies have shown that the spatial distribution of population is affected by comprehensive factors such as economy, policy, environment, and resources 20,23 . Therefore, considering the availability of data, several environmental factors widely used in existing research 20,21,[35][36][37][38] were taken as input datasets to conduct our spatial projection, including travel time to cities 38 DEM, slope, distance to road, distance to cities, Global Land Cover (mainly focuses on natural conditions) 39 and Global Urban Land Use Change Product (GULCP), the world's first 1-km resolution maps of future global urban land predicted under the SSP framework using the FLUS model. The high-resolution GULCP preserves spatial details and can avoid the distortions in global urban land patterns 40 . Significant differences in the predicted paths of future urban development among the five scenarios are that Scenario SSP5 has an increasing trend and the largest urban land area, scenario SSP2 and SSP3 produce similar trends to SSP5, but with much smaller urban land areas. For the SSP1 and SSP4 scenarios, the urban land demand is expected to decline in the 2080s and 2070s, respectively, due to a hypothetical slowdown in socioeconomic growth 40 . The projections are comparable to three existing representative global urban land projections by Chen et al. 40 , and the results show that GULCP has high resolution and is precise, which can enhance support the research in other related disciplines, such as ecological protection, urban climate and global climate change. Furthermore, the surrounding population distributions of each grid were also taken into consideration based on existing researches 23,40 . The source datasets used for the global spatially explicit population projection are listed in Shared socioeconomic pathways scenario (SSPs). The SSPs used in this study are a set of future pathways of societal development that are developed for use in global climate change research 3,41 . The SSPs describe five alternative outcomes of trends in demographics, economic development, urbanization and so on that are provided by the International Institute for Applied Systems Analysis (IIASA) 41,42 . The five population scenarios are colloquially named SSP1 (Sustainability), SSP2 (Middle of the Road), SSP3 (Regional Rivalry), SSP4 (Inequality), and SSP5 (Fossil-fuelled Development) ( Table 3) 21 . This study follows the population projection data made by IIASA 42 and urban land expansion projections made by Chen et al. 40 to simulate future population changes for the globe. The SSP dataset and more research on the SSPs can be found at the following link: https://iiasa.ac.at/web/ home/research/researchPrograms/Energy/SSP_Scenario_Database.html.
Sampling method. Due to the huge number of pixels of the population grid, sampling across sub-regions is urgently needed before predicting. There is less related research on how to sample population grids scientifically, so we tried some sampling methods, such as random, cluster, systematic, and stratified random sampling 43 , to explore which sampling method was more suitable for this work. The experimental results proved that population distribution on the globe is extremely uneven, so a large number of noise grids (sparsely populated grids) will be obtained by systematic and random sampling. This will reduce the interpretation of RF model. Cluster sampling will select all grids being concentrated in a certain area, which is not conducive to prediction for the globe. Chen et al. 23 raised a stratified random sampling method by dividing explicit population grids into four kinds of 250 km blocks (i.e., high-density, medium-density, low-density, and sparsely populated), and collecting sample points in the first three kinds 23 . They equally allocated 2,000 points from each block for machine learning model building and obtained reliable projection data. Although the sample placement (the distribution of 250 km blocks) may www.nature.com/scientificdata www.nature.com/scientificdata/ Multiple input datasets are extracted as a table based on the EU samples. These values are divided as train and test sets for the EU RF model, and the trained model is utilized to produce EU population potential surfaces. SSPs are used as a total population constraint at the national level. In procedure three (Future projection), we conduct cyclical projections according to time series (5-year intervals) for EU. Furthermore, all 8 regions are predicted as in procedure two. Finally, we merge results to obtain the final population projections for the globe. (2022) 9:563 | https://doi.org/10.1038/s41597-022-01675-x www.nature.com/scientificdata www.nature.com/scientificdata/ have more effect on accuracy than the sampling method 23 , the representativeness of each block was enhanced by considering whether there were significant cities within the block.
The specific descriptions of this sampling method are as follows (see Fig. 2). First, we tessellate the territory of 8 regions by 250 km blocks and calculate the population density of each block. Second, we divide each region into more than 4 types and select enough 250 km blocks for 8 regions, ensure that there is at least one important city (capital, provincial capital or economic center) inside each block and consider its spatial location (try to make blocks evenly distributed in each position, rather than clustering in a certain area). Then, we select 6 blocks for each region (2 high-density, 2 medium-density and 2low-density blocks) 23 . However, due to the massive population of SEA (more than 3 billion in 2020) and the small population of OC (about 30 million in 2020), we adjust the number of blocks and select 3 in OC (1 block for each) and 12 blocks in SEA (4 blocks for each), respectively. Third, 2000 points are sampled randomly in each block for building RF model. The third step has strong robustness as shown in the validation part (see Supplementary Table 2). To reduce the risk of oversampling from lightly populated areas, we conduct statistical analysis, and Fig. 3 demonstrates that these sampling points are reliable. Finally, we utilize 8 region datasets to build our RF model.

RF model development.
We build RF models for 8 regions respectively, and EU is taken as an example ( Fig. 1). Based on the 12,000 EU sample points, values of all input datasets are extracted as a table. These values are divided as train set (80%) and test set (20%) for EU RF model training. We train each model 20 times and select the most accurate one for producing EU population potential surfaces. The performance of each RF model is verified. We exclude the uninhabitable areas and take SSPs as the total constraint at the national level. Moreover, Urban Land Use dataset produced by Chen et al. 40 , which predicts the future urban expansion (2020-2100) under five SSP scenarios, is also used as input data, and they will change as the year goes (5-year intervals), which can help to better simulate the development of future population distribution.
Future projection. In this procedure, we conduct cyclical projections according to time series (5-year intervals) for all regions. Population distribution (WorldPop dataset), SSPs population projections at the country level and Urban Land Use dataset are changing over time as input datasets for simulating SSPs. Finally, we merge 8 regions' population projection results to obtain the final projection dataset for the globe.
However, the population data provided by the SSPs (188 countries or areas in this research) does not cover every country and area on the globe. For those 60 countries or areas without SSPs projection data, we skip the population adjustment step. And the final population dataset we predicted covers 248 countries or areas on the globe. The list of countries is shown in Supplementary Table 3.
Finally, we compare the differences between five SSPs by selecting two examples on the globe in 2100, as shown in Fig. 4. It can be seen from the figure that the population distribution under the five scenarios is substantially different. The future development of population is complex, which is the result of the intersection of   www.nature.com/scientificdata www.nature.com/scientificdata/ the country's total population and urbanization development pattern. Under the SSP3 scenario, Paris' population may shrink in 2100 compared with 2020 because of the decrease of France's population, but the population of New Delhi and surrounding cities may increase. The population of India in 2100 is essentially the same under the SSP1 and SSP5 scenarios. Under SSP5, the urban area of New Delhi and surrounding cities may expand more widely than under SSP1. For the area close to cities, this could lead to an increase in population. But for further areas, the population may decrease. Government, organizations, or researchers can utilize this dataset in different scenarios according to their research objectives, such as sustainable development, global climate change, energy consumption and so on.

Data Records
The projected gridded global population data under five SSP scenarios from 2020 to 2100 are stored as a GeoTIFF file (.tif) with the WGS84 projection at approximately 1 km (30 arc-seconds) resolution. These can be freely and publicly accessed at Figshare

Technical Validation
The technical validation of our dataset is performed in four parts: (1) robustness test for sampling method, (2) performance of RF model on test sets, (3) comparison of predicted values and observed values, and (4) comparison of our dataset with published related datasets. Considering input datasets, the third comparison can be only verified in 2020, whereas the last can be verified in both 2020 and the future.
We use MAE (Mean Absolute Error), which reflects the overall accuracy of the projections, and RMSE (Root-Mean-Square-Error), which reflects the bias of the projections, and %RMSE, which eliminates the influence of population size on RMSE, to verify our projection at the sub-national level. These metrics are commonly used to evaluate the accuracy of population projections. The equations for the indicators are as follows, where y i,pre , y i,obs represent the predicted and observed value for grid i, respectively. n is the number of grids. y i obs , represents the mean value of the observed dataset.   Table 4 for EU and Supplementary Table 2 for all 8 regions). The results are stable, which shows our sampling method is robust. Table 5 shows the performances of 8 RF models' test sets. The Number of Trees (a hyperparameter of RF model) for 8 models is 500, which is the same as the existing studies 23 . The %RMSE of our models ranges from 7.65% to 47.85%, the same level as results made by Chen et al. 23 for China (7.78%-24.84%).
Before validation, we first adjust our dataset in 2020. As shown in RF model development, we take SSPs population projections as the total constraint at the national level, but the observed values are under assumption made by WorldPop dataset, not the SSPs. To eliminate the influence of technical validation caused by this difference, we adjust our dataset according to the national population aggregated from WorldPop 2020, and regard this as predicted values for further validation.  Fig. 5, each red point represents one province). For grid level, we sample 100,000 points randomly in each region (including numerous sparsely populated points). Points with a population of less than 1 are eliminated, and we make sure each region has more than 50,000 points who participated in the verification (as shown in Fig. 5, each blue point represents one population point randomly selected from each region). Table 6 shows the projection errors both at the sub-national and grid level by comparing predicted and observed values (WorldPop 2020), and the distributions   www.nature.com/scientificdata www.nature.com/scientificdata/ of these values are shown in Fig. 5. The %RMSE values of 8 regions are ranged from 5.51% to 59.73% (Table 6, sub-national level), which are acceptable compared to the results of Sorichetta et al. 37 (%RMSE values are 52.96%-259.81% for LA sub-national administrations population projection). Compared with the validation results from Boke-Olén et al. 21 (RMSEs are 26,917-1,162,510 for SSA sub-national administrations), our validation results show that our population projection results are accurate (RMSE is 313,968.62 for SSA). The MAE of our dataset for SEA at grid level is 55.57 (Table 6, ~1 km grid level), which is nearly equal to the validation results of Chen et al. 22 (49.7-58.2). All these comparisons demonstrate that our predictive method and global gridded population projection products are reliable, which can provide support for research in other fields. comparison with other datasets. Existing related datasets, including projection datasets for the globe 15,16 and regions 21,22 , are taken into comparison. Figure 6 shows that our dataset seems to better fit with the current remote sensing image compared to the other datasets and smoother compared to the city level datasets in Africa and China. This means that our dataset offers the possibility to compare population development patterns at the city scale under different SSP scenarios. We have made a preliminary discussion in the Future projection part.
Strengths, limitations and uncertainties. The first strength of this dataset is the application of machine learning methods, which can identify the vital relationship between different input datasets. The second strength is the continuous time series. This dataset is designed for comparing over time. The third strength is adaptability to other studies. Some input datasets (GULCP and SSPs population projection at the national scale) are changing from year to year, which means our projections are consistent with these studies. The fourth strength is that this population projection matches with satellite better than other related studies, which means this dataset can be applied to the differences in development population patterns under 5 SSP scenarios.
However, our study still has some limitations. First, although this dataset is capable of demonstrating different population patterns among 5 SSP scenarios for the same city, it fails to consider the urbanization rate. It means that this dataset and other urban land cover datasets (e.g., GUCLP) should not be combinedly used for  www.nature.com/scientificdata www.nature.com/scientificdata/ calculating urbanization rates. Second, WorldPop population in 2015 and 2020 may be based on the same underlying input population data 45 which may cause the validation results (especially the result in Fig. 5 and Table 6) to appear better than they are.
Moreover, there are still method and policy uncertainties in this study that may affect the predicted results. For method uncertainties, the interval of GULCP is 10 years, but our projection data is 5 years. We had to use GULCP 2020 instead of GULCP 2025 as the urban land use input data to predict population distribution in 2025; Second, the RF model of USC has a low %RMSE value on the test set (Table 5) but the overall projection result is not ideal (Fig. 5), indicate that the model may be affected by noisy data or the samples are not well represented, which requires further research. However other regions' model does not occur this error.
For policy uncertainties, China has implemented population ceiling policies in mega-cities, so the population growth of them may be limited. The model for this study does not consider the impact of policy factors on population distribution. In addition, due to ethnic, energy, and territorial issues, some countries such as Afghanistan, Israel, and Iraq are affected by war year-round, and their population changes lack regularity. Moreover, diseases, natural disasters and other emergencies will change the spatial distribution of population at different levels. For example, the COVID-19 pandemic, which erupted globally in 2020, has a rapid spread with a high fatality rate, and the different severities of the pandemic in different countries may lead to a redistribution of the population.  Table 6.
www.nature.com/scientificdata www.nature.com/scientificdata/ While our projection method is a general one, based on the historical population distribution and SSP scenarios, it does not consider such specific impacts yet.

Usage Notes
Based on the WorldPop dataset, SSPs population projection and other related covariates, we provide a range of future population projections from 2020 to 2100 at a 5-year interval. Each projection product has the spatial distribution of population at an approximately 1 km (30 arc-seconds) spatial resolution. With such a large need for gridded global population projections and to better understand demographic trends, we produce a set of quality projections and make both the code and population projection products available for a wide audience.
To verify the accuracy of the population projection data, we verify the predicted population data at both sub-national and grid levels based on the values of MAE, RMSE and %RMSE. The verification results show that our population projection product has small deviations in most areas of the world and can truly reflect future population changes and distributions.

code availability
The global gridded population dataset was created using python 3.9.7 as well as ArcGIS 10.6 software platform, and the code of key steps can be available at Figshare. The code can be downloaded at Figshare (https://doi. org/10.6084/m9.figshare.19609356.v3) 46 .